【TASLP】Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

Publication Date：2023-06-16 Return

Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

Shared by：Yidi Li
Research direction：Audio-Visual Localization and Tracking
Title：Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Authors：Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, Haizhou Li
Institution：University of Science and Technology Beijing, Chinese University of Hong Kong at Shenzhen, National University of Singapore
Abstract：Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, as an essential robotic function, was traditionally solved as a signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual signals in an effective way. Speaker tracking is not only more desirable, but also potentially more accurate than speaker localization because it explores the speaker's temporal motion dynamics for smoothed trajectory estimation. However, due to the lack of large annotated dataset, speaker tracking is not well studied as speaker localization. In this paper, we study robotic speaker Direction of Arrival (DoA) estimation with a focus on audio-visual fusion and tracking methodology. We propose a Cross-Modal Attentive Fusion (CMAF) mechanism, which explores self-attention to learn intra-modal temporal dependencies, and cross-attention mechanism for inter-modal alignment. We also collect a realistic dataset on a robotic platform to support the study. The experimental results demonstrate that our proposed network outperforms the state-of-the-art audio-visual localization and tracking methods under noisy conditions, with an improved accuracy of 5.82% and 3.62% at SNR=−20 dB, respectively.
Article link：

click here